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Motivated by a broad range of potential applications, we address the 
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show its consistency under a minimum of conditions. Our approach 
builds on the methodology developed in recent years for prediction of 
individual sequences and exploits the quantile structure as a minimizer 
of the so-called pinball loss function. We perform an in-depth analy- 
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generally outperforms standard quantile prediction methods. 

Index Terms — Time series, quantile prediction, pinball loss, sequen- 
tial prediction, nearest neighbor estimation, consistency, expert ag- 
gregation. 

AMS 2000 Classification: 62G08; 62G05; 62G20. 

* Partially supported by the French "Agence Nationale pour la Recherche" under 
grant ANR-09-BLAN-0051-02 "CLARA". Research carried out within the INRIA project 
"CLASSIC" hosted by Ecole Normale Superieure and CNRS. 

^" Corresponding author. 



1 



1 Introduction 



Forecasting the future values of an observed time series is an important prob- 
lem, which has been an area of considerable activity in recent years. The 
application scope is vast, as time series prediction applies to many fields, 
including problems in genetics, medical diagnoses, air pollution forecasting, 
machine condition monitoring, financial investments, production planning, 
sales forecasting and stock controls. 

To fix the mathematical context, suppose that at each time instant n = 
1,2,..., the forecaster (also called the predictor hereafter) is asked to guess 
the next outcome ?/„ of a sequence of real numbers yi,y2, ■ ■ ■ with knowledge 
of the past y""^ = {yi, . . . ,yn-i) (where denotes the empty string). For- 
mally, the strategy of the predictor is a sequence g = {gn}'^=i of forecasting 
functions 

(?„ : M"-i ^ R 

and the prediction formed at time n is just gn{y\~^)- Throughout the paper 
we will suppose that yi, 2/2, •• • are realizations of random variables Yi, 1^2, •• • 
such that the stochastic process {Yn\°^oo is jointly stationary and ergodic. 

Many of the statistical techniques used in time series prediction are those of 
regression analysis, such as classical least square theory, or are adaptations or 
analogues of them. These forecasting schemes are typically concerned with 
finding a function gn such that the prediction gn(Y^~^) corresponds to the 
conditional mean of Yn given the past sequence Yi"~^, or closely related quan- 
tities. Many methods have been developed for this purpose, ranging from 
parametric approaches such as AR(p) and ARMA(p,g) processes (Brockwell 
and Davies pQ) to more involved nonparametric methods (see for example 
Gyorfi et al. [2] and Bosq for a review and references). 

On the other hand, while these estimates of the conditional mean serve their 
purpose, there exists a large area of problems where the forecaster is more 
interested in estimating conditional quantiles and prediction intervals, in or- 
der to know other features of the conditional distribution. There is now a 
fast pace growing literature on quantile regression (see Gannoun, Saracco 
and Yu J4j for an overview and references) and considerable practical ex- 
perience with forecasting methods based on this theory. Economics makes 
a persuasive case for the value of going beyond models for the conditional 
mean (Koenker and Allock [5]). In financial mathematics and financial risk 
management, quantile regression is intimately linked to the r- Value at Risk 



2 



(VaR), which is defined as the (1 — r)-quantile of the portfoho. For exam- 
ple, if a portfoho of stocks has a one-day 5%-VaR of €1 milhon, there is a 
5% probabihty that the portfoho wiU faU in value by more than €1 million 
over a one day period (Duffie and Pan [6]). More generally, quantile regres- 
sion methods have been deployed in social sciences, ecology, medicine and 
manufacturing process management. For a description, practical guide and 
extensive list of references on these methods and related methodologies, we 
refer the reader to the monograph of Koenker [7] . 

Motivated by this broad range of potential applications, we address in this 
paper the quantile prediction problem of real-valued time series. Our ap- 
proach is nonparametric in spirit and breaks with at least three aspects of 
more traditional procedures. First, we do not require the series to necessar- 
ily satisfy the classical statistical assumptions for bounded, autoregressive or 
Markovian processes. Indeed, our goal is to show powerful consistency results 
under a strict minimum of conditions. Secondly, building on the methodology 
developed in recent years for prediction of individual sequences, we present 
a sequential quantile forecasting model based on the combination of a set of 
elementary nearest neighbor-type predictors called "experts" . The paradigm 
of prediction with expert advice was first introduced in the theory of machine 
learning as a model of online learning in the 1980-early 1990s, and it has been 
extensively investigated ever since (see the monograph of Cesa-Bianchi and 
Lugosi for a comprehensive introduction to the domain). Finally, in op- 
position to standard nonparametric approaches, we attack the problem by 
fully exploiting the quantile structure as a minimizer of the so-called pinball 
loss function (Koenker and Basset [9j). 

The document is organized as follows. After some basic recalls in Section 
2, we present in Section 3 our expert-based quantile prediction procedure 
and state its consistency under a minimum of conditions. We perform an 
in-depth analysis of real-world data sets and show that the nonparametric 
strategy we propose is faster and generally outperforms traditional methods 
in terms of average prediction errors (Section 4). Proofs of the results are 
postponed to Section 5. 
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2 Consistent quantile prediction 



2.1 Notation and basic definitions 

Let y be a real- valued random variable with distribution function Fy, and 
let r G (0, 1). Recall that the generalized inverse of Fy 

F^{t) = mf{t G M : Fy(t) > r} 

is called the quantile function of Fy and that the real number qr = Fyir) 
defines the r-quantile of Fy (or Y). The basic strategy behind quantile esti- 
mation arises from the observation that minimizing the £i-loss function yields 
the median. Koenker and Basset [S] generalized this idea and characterized 
the r-quantile by tilting the absolute value function in a suitable fashion. 

Lemma 2.1 Let Y he an integrable real-valued random variable and, for 
T G (0, 1), let the map 

Priy) = ?/(r- l[^<o]). 

Then the quantile qr satisfies the property 

g,- G argminE [pt-(F — g)] . (2.1) 

qm 

Moreover, if Fy is (strictly) increasing, then the minimum is unique, that is 

{qr} = argminE [pr(Y — q)] . 

We have not been able to find a complete proof of this result, and we briefly 
state it in Section 5. The function pr, shown in Figure 12.11 is called the 
pinball function. For example, for r = 1/2, it yields back the absolute value 
function and, in this case. Lemma [2]T] just expresses the fact that the median 
'?i/2 = -^^(1/2) is a solution of the minimization problem 

qi/2 G argminE|y — q\. 

q€R 

These definitions may be readily extended to pairs (X, Y) of random variables 
with conditional distribution Fy\x- In this case, the conditional quantile 
qr{X) is the measurable function of X almost surely (a.s.) defined by 

qr{X) = F^x(^) = inf{t G M : Fy\xit) > r}, 

and, as in Lemma 12. H it can be shown that for an integrable Y 

qr{X) G argminEp^ [pr {Y - q{X))] , (2.2) 
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Figure 1: Pinball loss function p^-. 

where the infimum is taken over the set of all measurable real- valued functions 
and the notation IEpy|_y stands for the conditional expectation of Y with 
respect to X. We note again that if Fy\x is a.s. increasing, then the solution 
of (12.21) is unique and equals QriX) a.s. In the sequel, we will denote by 
Qt(^y\x) the set of solutions of the minimization problem (12.21) . so that 
QriX) G Qt{^y\x) and {qr{X)} = Q^iFyix) when the minimum is unique. 

2.2 Quant ile prediction 

In our sequential version of the quantile prediction problem, the forecaster 
observes one after another the realizations yi,y2, ... of a stationary and er- 
godic random process Yi, • • • At each time n = 1,2,..., before the n-th 
value of the sequence is revealed, his mission is to guess the value of the 
conditional quantile 

qAYr') = Fy^wr^i^) = e ^ ■ Py^wr^it) > r}, 

on the basis of the previous n — 1 observations Y^^^ = {Yi, . . . , only. 
Thus, formally, the strategy of the predictor is a sequence g = {gn}^=i of 
quantile prediction functions 

gn : M""^ ^ R 

and the prediction formed at time n is just gn{yi~^)- After n time instants, 
the (normalized) cumulative quantile loss on the string Y^ is 

Ln{g) = lj2pr {yt-gtiXt')). 
t=i 

Ideally, the goal is to make Ln{g) small. There is, however, a fundamental 
limit for the quantile predictability, which is determined by a result of Algoet 
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[TU] : for any quantile prediction strategy g and jointly stationary ergodic 
process {Y^}"^^, 

lim ini Lnig) > L* a.s., (2.3) 



where 



minEp [Pr {Yo - qiYZ^))] 

q{.) '^Ol'^_oo 



is the expected minimal quantile loss over all quantile estimations of Iq based 
on the infinite past observation sequence YZ^ = (• • • ,^-2,^-1)- Generally, 
we cannot hope to design a strategy whose prediction error exactly achieves 
the lower bound L*. Rather, we require that Ln{g) gets arbitrarily close to 
L* as n grows. This gives sense to the following definition: 

Definition 2.1 A quantile prediction strategy g is called consistent with re- 
spect to a class C of stationary and ergodic processes {Yn}°^oo /^'^ ^o,ch 
process in the class, 

lim Ln{g) = L* a.s. 

Thus, consistent strategies asymptotically achieve the best possible loss for 
all processes in the class. In the context of prediction with squared loss, 
Gyorfi and Lugosi [TT], Nobel [12], Gyorfi and Ottucsak [13] and Biau et 
al. \1AI study various sequential prediction strategies, and state their consis- 
tency under a minimum of assumptions on the collection C of stationary and 
ergodic processes. Roughly speaking, these methods consider several "sim- 
ple" nonparametric estimates (called experts in this context) and combine 
them at time n according to their past performance. For this, a probability 
distribution on the set of experts is generated, where a "good" expert has 
relatively large weight, and the average of all experts' predictions is taken 
with respect to this distribution. Interestingly, related schemes have been 
proposed in the context of sequential investment strategies for financial mar- 
kets. Sequential investment strategies are allowed to use information about 
the market collected from the past and determine at the beginning of a train- 
ing period a portfolio, that is, a way to distribute the current capital among 
the available assets. Here, the goal of the investor is to maximize his wealth 
in the long run, without knowing the underlying distribution generating the 
stock prices. For more information on this subject, we refer the reader to Al- 
goet Gyorfi and Schafer pjj, Gyorfi, Lugosi and Udina [TT], and Gyorfi, 
Udina and Walk [18\. 

Our purpose in this paper will be to investigate an expert-oriented strategy 
for quantile forecasting. With this aim in mind, we define in the next section 
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a quantile prediction strategy, called nearest neighbor-based strategy, and 
state its consistency with respect to a large class of stationary and ergodic 
processes. 



3 A nearest neighbor-based strategy 

The quantile prediction strategy is defined at each time instant as a con- 
vex combination of elementary predictors (the so-called experts), where the 
weighting coefficients depend on the past performance of each elementary 
predictor. To be more precise, we first define an infinite array of experts 

(k £) 

hn , where k and i are positive integers. The integer k is the length of the 
past observation vectors being scanned by the elementary expert and, for 
each i, choose pi e (0, 1) such that 

lim Pi = , 

£— j-oo 

and set 

i= [pen\ 

(where [.J is the fioor function). At time n, for fixed k and i {n > k+i+1), the 
expert searches for the £ nearest neighbors (NN) of the last seen observation 
VnZl in the past and predicts the quantile accordingly. More precisely, let 

4'''^= {k<t<n: y\-_\ is among the £-NN of yUl in . . . , 

and define the elementary predictor Yiw'^^ by 

V-^^^ e argmin V p^ivt - q) 

ifn>fc + £ + l, and otherwise. Next, let the truncation function 

{a if ;z > a; 
z if \z\ < a; 
—a if 2; < —a, 

and let 

ht'^=T^u.in^i)ohl'^'\ (3.1) 

where 5 is a positive parameter to be fixed later on. We note that the 
expert hn'^^ can be interpreted as a (truncated) l-nearest neighbor regression 
function estimate drawn in M'^ (Gyorfi et al. f9]). The proposed quantile 
prediction algorithm proceeds with an exponential weighting average of the 
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experts. More formally, let {bk/} be a probability distribution on the set of 
all pairs {k, i) of positive integers such that for all k and £, b^/ > 0. Fix a 
learning parameter rjn > 0, and define the weights 



bk/d 



■rin(n-l)L„-i{h, 



and their normalized values 



Pk,e,n — 




The quantile prediction strategy g at time n is defined by 



oo 



n = l,2,... 



(3.2) 



k,e=i 



The idea of combining a collection of concurrent estimates was originally 
developed in a non-stochastic context for online sequential prediction from 
deterministic sequences (Cesa-Bianchi and Lugosi |H]). Following the termi- 
nology of the prediction literature, the combination of different procedures is 
sometimes termed aggregation in the stochastic context. The overall goal is 
always the same: use aggregation to improve prediction. For a recent review 
and an updated list of references, see Bunea and Nobel [20] . 

In order to state consistency of the method, we shall impose the following 
set of assumptions: 

(HI) One has E[Y;^] < oo. 

{H2) For any vector s G M.^, the random variable \\Y^ — s|| has a continuous 
distribution function. 

(if3) The conditional distribution function Fy ly-i is a.s. increasing. 

Condition {H2) expresses the fact that ties occur with probability zero. A 
discussion on how to deal with ties that may appear in some cases can be 
found in [21], in the related context of portfolio selection strategies. Con- 
dition {H3) is mainly technical and ensures that the minimization problem 
fl2.2l) has a unique solution or, put differently, that the set Q^(Py^|y-i ) re- 
duces to the singleton {F^ , (r)}. 

We are now in a position to state the main result of the paper. 
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Theorem 3.1 Let C be the class of all jointly stationary ergodic processes 
{Yn}'^oo satisfying conditions {H1)-{H3) . Suppose in addition thatnrjn — > oo 
and n^^rfn — )■ as n — )• cxd. Then the nearest neighbor quantile prediction 
strategy defined above is consistent with respect to C. 

The truncation index T in definition (13. ip of the elementary expert /i^f'^'' 
is merely a technical choice that avoids having to assume that IIqI is a-s. 
bounded. On the practical side, it has little infiuence on results for relatively 
short time series. On the other hand, the choice of the learning parameter 
rjn as l/y/n ensures consistency of the method for < 5 < |. 

4 Experimental results 
4.1 Algorithmic settings 

In this section, we evaluate the behavior of the nearest neighbor quantile 
prediction strategy on real-world data sets and compare its performances to 
those of standard families of methods on the same data sets. 

Before testing the different procedures, some precisions on the computational 
aspects of the presented method are in order. We first note that infinite sums 
make formula (13. 2p impracticable. Thus, for practical reasons, we chose a 
finite grid {k, i) E IC x C of experts (positive integers), let 

9n{yr')= E PkAnht'\yr'), n = l,2,... (4.1) 

keK./ec 

and fixed the probability distribution {qk/} as the uniform distribution over 
the |/C| X |£| experts. Observing that hn'^^'^ = hn'^^'' and hk/^ = bk/^ whenever 
^1 = ^2, formula (14. ip may be more conveniently rewritten as 

where C = {i : i E C}. In all subsequent numerical experiments, we chose 
/C = {1, 2, 3, . . . , 14} and £ = {1, 2, 3, ... , 25}. 

Next, as indicated by the theoretical results, we fixed rjn = -^/l/n. For a 
thorough discussion on the best practical choice of we refer to [H]. To 
avoid numerical instability problems while computing the Pk,l,ni applied 
if necessary a simple linear transformation on all Ln{hn'^'')i just to force 
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these quantities to belong to an interval where the effective computation of 
X (— )■ exp(— s) is numerically stable. 

Finally, in order to deal with the computation of the elementary experts (13. ip . 
we denote by [.] the ceiling function and observe that if m x r is not an inte- 
ger, then the solution of the minimization problem argmin^^gj^ Yl^i Pr iUi ~ b) 
is unique and equals the \m x r]-th element in the sorted sample list. On 
the other hand, if m x r is an integer then the minimum is not unique, but 
the m X r-th element in the sorted sequence may be chosen as a minimizer. 
Thus, practically speaking, each elementary expert is computed by sorting 
the sample. The complexity of this operation is 0(£log(£)) — it is almost lin- 
ear and feasible even for large values of i . For a more involved discussion, 
we refer the reader to Koenker [7]. 

All algorithms have been implemented using the oriented object language 
C# 3.0 and .NET Framework 3.5. 



4.2 Data sets and results 

We investigated 21 real- world time series representing daily call volumes en- 
tering call centers. Optimizing the staff level is one of the most difficult and 
important tasks for a call center manager. Indeed, if the staff is overdimen- 
sioned, then most of the employees will be inactive. On the other hand, un- 
derestimating the staff may lead to long waiting phone queues of customers. 
Thus, in order to know the right staff level, the manager needs to forecast 
the call volume series and, to get a more accurate staff level planning, he has 
to forecast the quantiles of the series. 

In our data set the series had on average 760 points, ranging from 383 for 
the shortest to 826 for the longest. Four typical series are shown in Figure 



We used a set V of selected dates m < n and, for each method and each time 
series {yi, . . . ,yn) (here, n = 760 on average), we trained the models on the 
pruned series . . . and predicted the r-quantile at time m + 1. The 
set V is composed of 91 dates, so that all quality criteria used to measure 
the proximity between the predicted quantiles and the observed values ym+i 
were computed using 91 x 21 = 1911 points. The 21 times series and the 
set V are available at the address http : //www. Ista.upmc . f r/doct/patra/ . 

In a first series of experiments, we let the methods predict the r-quantiles 
at the 1911 dates for r G {0.1,0.5,0.9}. We compared the performances of 
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Figure 2: Four call center series, out of 21. 



our expert-based strategy, denoted hereafter by QuantileExpertMixture^, 
with those of QAR(p)^, a r-quantile linear autoregressive model of order p. 
This quantile prediction model, which is described in [7], also uses the pin- 
ball criterion to fit its parameters. The implementation we used solves the 
minimization problem with an Iterative Re-weighted Least Square algorithm 
(IRLS), see for instance Street, CaroU and Ruppert [22]. Following Takeuchi, 
Le, Sears and Smola [23], we used two criteria to measure the quality of the 
overall set of quantile forecastings. First, we evaluated the expected risk with 
respect to the pinball function p^, referred to as PinBall Loss in the sequel. 
Secondly we calculated Ramp Loss, the empirical fraction of quantile estimates 
which exceed the observed values ym+i- Ideally, the value of Ramp Loss should 
be close to 1 — r. 

Tables [1][3] show the QuantileExpertMixture^ and QAR(p)^ results at the selected 
dates P of the call center series. The latter algorithm was benchmarked for each 
order p in {1, . . . , 10}, but we reported only the most accurate order p = 7. The 
best results with respect to each criterion are shown in bold. We see that both 
methods perform roughly similarly, with eventually a slight advantage for the 
autoregressive strategy for r = 0.1 whereas QuantileExpertMixture^ does better 
for T = 0.9. 



Method 


PinBall Loss (0.1) 


Ramp Loss 


QuantileExpertMixtureQ ^ 
QAR(7)o.i 


13.71 
13.22 


0.80 
0.88 



Table 1: Quantile forecastings with r = 0.1. 
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Method 


PinBall Loss (0.5) 


Ramp Loss 


QuantileExpertMixtureQ 5 


24.05 


0.42 


QAR(7)o.5 


29.157 


0.47 


Table 2: Quantile forecastings with r = C 


.5. 


Method 


PinBall Loss (0.9) 


Ramp Loss 


QuantileExpertMixtureQ g 


12.27 


0.07 


QAR(7)o.9 


19.31 


0.07 



Table 3: Quantile forecastings with r = 0.9. 



Median-based predictors are well known for their robustness while predicting indi- 
vidual values for time series, see for instance Hall, Peng and Yao [23]. Therefore, 
in a second series of experiments, we fixed r = 0.5 and focused on the problem 
of predicting future outcomes of the series. We decided to compare the results of 
QuantileExpertMixtureQ 5 with those of 6 concurrent predictive procedures: 

• MA denotes the simple moving average model. 

• AR(p) is a linear autoregressive model of order p, with parameters computed 
with respect to the usual least square criterion. 

• QAR(p) is the r-quantile linear autoregressive model of order p described 
earlier. 

• DayOf TheWeekMA is a naive model, which applies moving averages on the 
days of the week, that is a moving average on the Sundays, Mondays, and 
so on. 

• MeanExpertMixture is an online prediction algorithm described in ^14j. It 
is based on conditional mean estimation and close in spirit to the strategy 
QuantileExpertMixtureQ 5. 

• And finally, we let HoltWinters be the well-known procedure which per- 
forms exponential smoothing on three components of the series, namely 
Level, Trend and Seasonality. For a thorough presentation of HoltWinters 
techniques we refer the reader to Madrikakis, Whellwright and Hyndman 

Accuracy of all forecasting methods were measured using the Average Absolute 
Error (AvG Abs Error, which is proportional to the pinball error since r = 0.5), 
Average Squared Error (Ave Sqr Error), and the unstable but widely spread 
criterion Mean Average Percentage Error (MAPE, see [25] for definition and dis- 
cussion). We also reported the figure Abs Std Dev which corresponds to the 
empirical standard deviation of the differences \y[ — yl^l, where y[ stands for the 
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forecasted value while stands for the observed value of the time series at time 
t. AR(p) and QAR(p) algorithms were run for each order p in {1, . . . , 10}, but we 
reported only the most accurate orders. 



Method 


AvG Abs Error 


AvG Sqr Error 


MAPE (%) 


Abs Std Dev 


MA 


179.0 


62448 


52.0 


174.8 


AR(7) 


65.8 


9738 


31.6 


73.5 


qAR(8)o.5 


57.8 


9594 


24.9 


79.2 


DayOfTheWeekMA 


54.1 


7183 


22.8 


64.7 


QuantileExpertMixtureQ g 


48.1 


5731 


21.6 


58.4 


MeanExpertMixture 


52.4 


6536 


22.3 


61.6 


HoltWinters 


49.8 


6025 


21.5 


59.5 



Table 4: Future outcomes forecastings. 



We see via Table H] that the nearest neighbor strategy presented here outperforms 
all other methods in terms of Average Absolute Error. Interestingly, this forecast- 
ing procedure also provides the best results with respect to the Average Squared 
Error criterion. This is remarkable, since QuantileExpertMixtureg 5 does not rely 
on a squared error criterion, contrary to MeanExpertMixture. The same comment 
applies to ClAR(8)o.5 and AR(7). In terms of the Mean Average Percentage Error, 
the present method and HoltWinters procedure provide good and broadly similar 
results. 

5 Proofs 

5.1 Proof of Theorem 13.11 

The following lemmas will be essential in the proof of Theorem 13.11 The first one 
is known as Breiman's generalised ergodic theorem (Breiman [26j). 

Lemma 5.1 Let Z = {Zn}??ao ^ stationary and ergodic process. For each 
positive integer t, let T* denote the left shift operator, shifting any sequence of real 
numbers {. . . , zq, zi, . . .} hy t digits to the left. Let {/f}^i he a sequence of 
real-valued functions such that Ymit^oo ft{Z) = f{Z) a.s. for some function f . 
Suppose that E[sup( < co. Then 

1 " 

hm -V/t(r*Z)=E[/(Z)] a.s. 

n— ^oo n ^— ' 

t=l 

Lemma 15.21 below is due to Gyorfi and Ottucsak |13j . These authors proved the 
inequality for any cumulative normalized loss of form Ln{h) = ^ X^"=i f-t{h), where 
it{h) = it{ht,Yt) is convex in its first argument, what is the case for the function 
it{ht,Yt) = pr{Yt-htiYl-')). 
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Lemma 5.2 Let g = {gn}'^=i be the nearest neighbor quantile prediction strategy 
defined in \3. Then, for every n > 1, a.s., 

L„(,)<inf(L„(e^))-^l^^) 
k/ \ nr]n+i ) 

^ n oo 
t=l fc,£=l 

Lemma 5.3 Let x,?/ G M and ^ G N. Then 

1. Prix) < \x\. 

2. Pr{x + y) < Pr{x) + Pr{y). 

3. pr {Ti{x) - TfXy)) < Pt{x - y). 
Proof of Lemma 15.31 Let x, y G M and ^ G N. 

1. We have 

\pr{x)\ = \x{t - l[x<0])\ = \x\ \t - l[^,<o]| < 

2. Clearly, 

Pr{x + y) < Pt{x) + Priy) 

x'i-lxKO] + yl[s/<0] < Xl[x+y<0] + yl[x+j/<0]- 

The conclusion follows by examining the different positions of x and y with 
respect to 0. 

3. Ifx>e and \y\ < i, then 

Pr {Te{x) - Te{y)) = Pr{i - y) 

= l[^-j,<o]) 
= {l- y)T 
<{x- y)T 

= {X - y){T - l[^^y<Q]) 

= Pt{x - y). 
Similarly, if x < — ^ and \y\ < i, then 

Pr {Te{x) - Ti{y)) = pr{-£ - y) 

= {-^-y){T -l[-i-y<Q]) 

= {-l-y){r-l) 
< (x-y)(r-l) 

= {X - y){T - l[x-y<0]) 

= Pt{x - y). 

All the other cases are similar and left to the reader. 



Pr (Yt-hf^^Yt')) 
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□ 

Recall that a sequence {Hn}'^=i of probability measures on M is defined to con- 
verge weakly to the probability measure /Xqo if for every bounded, continuous real 
function /, 

fdfin / fdfioo as n oo. 



Recall also that the sequence {^n}^=i is said to be uniformly integrable if 

lim sup / \x\d^n{x) = 0. 

n>l J\x\>a 



n>l J|a;|>a 

Moreover, if 



sup / \x\^^^ dfin{x) < oo 

n>l J 



for some positive e, then the sequence {/^n}^i is uniformly integrable (Billingsley 

127]). 

The next lemma may be summarized by saying that if a sequence of probability 
measures converges in terms of weak convergence topology, then the associated 
quantile sequence will converge too. 

Lemma 5.4 Let be a uniformly integrable sequence of real probability 

measures, and let fioo be a probability measure with (strictly) increasing distribution 
function. Suppose that {fin}^=i converges weakly to fioo- Then, for all r G (0, 1), 

qT,n qT,oo as n—)- oo, 

where qr,n G QrifJ-n) for all n > 1 and {Qt^qo} = Qt(^oo)- 

Proof of Lemma 15.41 Since {fin}^=i converges weakly to /Uqo, it is a tight se- 
quence. Consequently, there is a compact set, say [— M, M], such that ^„(]R \ 
[-M, M]) < min(T, 1 - r). This implies qr,n S [-M,M] for ah n > 1. Conse- 
quently, it will be enough to prove that any consistent subsequence of {qT,n}^=i 
converges towards Q'r.oo- 

Using a slight abuse of notation, we still denote by {qT,n}^=i a consistent subse- 
quence of the original sequence, and let g,-,* be such that lim„_>oo qT,n = Qt,*- Using 
the assumption on the distribution function of ^oo; we know by Lemma 1 2. II that 
Qt,oo is the unique minimizer of problem (j2.ip . Therefore, to show that (7,-,* = Qt,cxi, 
it suffices to prove that, for any g G M, 

Emo. [Pr{Y-q)]>E^^ [p^(y_g^^^)]. 

Fix g G M. We first prove that 

E^APr{Y-q)]^E^^[pr{Y-q)] as n ^ 00. (5.1) 
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To see this, for M > and all y G M, set 



(y) 



r if \y\ < M; 

Pr{y) if|y|>M + l; 

Pr{M + l){y-M) ifye[M,M + l]; 

p^(_M - l)(y + M) if y G [-M - 1, -M]. 



The function p^'^"^ is continuous and, for all z G M, satisfies the inequality 
^(+,Af)^^^ < p^(2;)l[|^|^j^^]. In the sequel, we will denote by pi the bounded 

and continuous map pr — p^'^^\ The decomposition p^ = p^'^^^ + pi is 
illustrated in Figure I^TTl 





— Pt (y) 


















-M - 1 --1/ 


M AI + 1 



Figure 3: Illustration of the decomposition pr = pt^'*^"* + pi 



Next, fix e > and choose Af large enough to ensure 

sup(E^„ [\Y - q\l[\Y-q\>M]\) [\y -q\HY-q\>M]] <e/2. 

n>l 



Choose also n sufficiently large to have 



E 



pi-^''\Y-q) 



E 



pi-^''\Y-q) <e/2 



Write 



|E;.„ [pr{Y-q)]-¥.,^ [pr{Y 



E„ 



E. 



pi''"\Y-q)_ 



+ E, 



p^-'''\Y-q) 



pi^'''>{Y-q) 



E„ 



\pi~'''\Y-q)] 
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Thus 



< 



< 



< 



E 



E 



/-t'OC 



E„ 



E„ 



E„ 



[pi-^'^'HY - q) 
'pi-'''\Y-q) 



+ |lE^n [Pt(5^ - g')l[|y-g|>A/]] I + l^fio. [Pt{Y - q)l[\Y-q\>M]] I 



E„ 



'pi-''^HY-q) 



E„ 



pi-'^'HY-q) 



+ SUpE^„ [|y - q\l[\Y~q\>M]] + E^oo [1^ - 9|l[ly-g|>Af]] 
n 

< £• 

for all large enough n. This shows ()5.ip . 

Next, using the fact that the function p,- is uniformly continuous, we may write, 
for sufficiently large n and all y S M, 



Priy - qT,n) > Priy - Qt,*) - e- 
Therefore, for all large enough n, 

E^oc [Pr{Y-q)]>¥.^„ [pr{Y-q)]-e 
(by identity ([O) ) 
>E^„ [p^iY -q^^n)]-e 
> E^„ [priY - qr,,)] - 2e 

(by inequality ([52])) 
>E^^ [p^{Y -q^^,)]-^e 
(by identity ^^). 



(5.2) 



Letting e — t- leads to the desired result. 

We are now in a position to prove Theorem 13.11 
Because of inequality ()2.3p it is enough to show that 



□ 



limsupL„((7) < L* a.s. 

n— >oo 
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With this in mind, we first provide an upper bound on the first term of the right 
hand side of the inequahty in Lemma 15.21 We have 



k,e \ \ nrin+i 

2 In bk/ 



< inf ( Hm sup L„ ( h!^'^^ 

< inf ( hm sup L„ ( /i^f'^^ 

k,e V m-5.no V 



To evaluate hmsup„_^oQ we investigate the performance of the expert 

hl!''^^ on the stationary and ergodic sequence Yo,Y-i,Y-2, ■ ■ ■ Fix pi S (0,1), 
s G M'^, and set £ = [pej\ , where j is a positive integer. 

For j > k + i + 1, introduce the set 

jjj^) = |_j + /c + l<f<0: Vi^^ is among the ^~-NN of s 

For any real number a, we denote by 6a the Dirac (point) measure at a. Let the 

(k I) 

random measure P^- g be defined by 



0^''^ = ^- y Sr.. 



n 



Take an arbitrary radius rk/{s) such that 

F [\\YZ,^ - s\\ < rkA^)] =pe. 
A straightforward adaptation of an argument in Theorem 3.1 of |18j shows that 



almost surely in terms of weak convergence. Moreover, by a double application of 
the ergodic theorem (see for instance [14j). 



Thus 



y'dPf/)(y) ^ / y2dp(^'J(y) a.s. 
.?>0 J 



and, consequently, the sequence {Pj g sj^i uniformly integrable. 



18 



yo\YL 



-1 IS a.s. m- 



By assumption {H2>) the distribution function of the measure 

creasing. We also have a {\\Yzl — s|| < rfc^^(s)) C cr {YZ^ where (t{X) denotes 
the sigma algebra generated by the random variable X. Thus the distribution 

is a.s. increasing, too. Hence, letting 



function of P^'i^ 



,[y)(y-.V„s)GQ.(p(;^)) and {ci^^,£ {s)] = QA 



■ oo,s )l 



J^OO 



Pr(yo-T^in(,^,,Aqiy\Yzl^„s))) ^ pJyo-TJqi%Hs))) a.s. 



we may apply Lemma 15.41 and obtain 
Consequently, for any yo £ M, 
Since yo and s are arbitrary, we are led to 

For y = {.. .,y-i,yo,yi, . . .), set 

My) = Pr {yo - hf''\yz]^,)) 

Clearly, 

|/,(y)l = |p. {y, - hf'\Y:I^S) 

< Yq- T^i^^^jS ^i){Y_^j^^) 
(by statement 1. of Lemma |5. 3 

< |lo| + T^yin{jS 1){YIIj^i) 

< \Yo\+i, 

and thus E[supj- \fjiY)\] < oo. By identity (fOj) . 

MY) pJYo-Te{qi%HYZ,' 

J— >-oo \ 

Consequently, Lemma 15.11 yields 



a.s. 
(5.3) 



a.s. 



E 



Pr (Yo-TJqi%HYZ,') 
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To lighten notation a bit, we set 



and proceed now to prove that linifc^oo lim^-i^oo £k,£ < L* . 

We have, a.s., in terms of weak convergence, 

p(fe/) p 

(see for instance Theorem 3.1 in Next, with a slight modification of tech- 

niques of Theorem 2.2 in [14j . 



which leads to 



sup / y^dP*''^'^_i (y) < oo a.s. 

£>0 J 



Moreover, by assumption {H3), the distribution function of Py^|y-i is a.s. in- 
creasing. Thus, setting 

{qi%HYZ,')} = Qr{^th^ and{g('=)(yt)} = Q.(P^,|^-.) 

and applying Lemma 15.41 vields 



e,£{y:l),^ e\Y-) a.s. 



Consequently. 



Yo - Te [qi%\YZ,'))) ^->^ pr (Yo - 4'Hyz,')) a.s. 

It turns out that the above convergence also holds in mean. To see this, note first 
that 

p. (Yo - Te [qi%\YZ,' 

= pr (Fo - UY,) + UY,) - T, (q^^£{Y:^) 

< p, {Yo - Te{Yo)) + pr [TfXYo) - T, {q^J^^^iY^^ 
(by statement 2. of Lemma 



<2|yol + P. (io-'?(%)(irfc')) 

(by statement 3. of Lemma l5.3p . 



a.s. 
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Thus 



E 



(p.(Fo-T,(,(M)(y-i)))) 

2|yo|+P. {Yo-qi%\YZ,'))) 
< 8E [Y,'] + 2E [(p. (Fo - qi%HYZ,'))) 



< E 



In addition, 



supE 



supE 



minE pr {Yq - q{Y_^)) 



< 



ipriYo)y 

(by Jensen's inequality) 
< E [Y^] < oo. 



This imphes 



E 



{pr{Yo-T,{4%\Y_-,^)))y 



< oo, 



i.e., the sequence is uniformly intcgiable. Thus we obtain, as desired. 



hm E 



Putting all pieces together. 



E 



Pr{Yo-qPiYZ,')) 



lim £k^i = lim E 



E 



pAy,-4^\y:^] 



— ^k- 



It remains to prove that limfc^oo ~ ^^^^ aim, for all A; > 1, let Zk be the 

a"(yj^^)-measurable random variable defined by 

Zk = Pr (Yo - qi'HYZ,')) = minEp [p, {Yo - q{YZ^))] . 

V / q(.) ^0 1 
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Observe that {-^fcj^o ^ nonnegative supermartingale with respect to the family 
of sigma algebras {o-(yr^^)}^^. In addition, 



supE[Z|] = supE 

k>l k>l 



< supE 

k>l 



< supE 

fc>i 



iPr{Yo)Y 



Therefore, 
where 

Consequently, 



(by Jensen's inequality) 
< supE[yo^] < oo. 

k>l 



E[Zk] ^ E[Z< 

k—^oo 



Zoo = minEp [p, (Fq - q{Yll))] 

q(.) ^0 1 ^'-oo 



hm el = L* 

fc— >oo 



We finish the proof by using Lemma 15.21 On the one hand, a.s., 
limsupinffL„Ye'^) ^^"^'^'^^ 

n~¥oo k,l \ \ 



Moreover, 

1 



< inf ( lim sup L„ ( h^J^'^'^ 

< inf ( lim sup L„ ( /i^f '^^ 

\ n-5-oo ^ 

= inf Ek/ 

k/ ' 

< lim lim Sk/ 

fc— >oo £— >oo ' 

< L\ 



nr]n+i 

2 In hk,i 



2n 



t=i ki=i 



< 



2n 



n oo oo 

t=i k=i 1=1 

n oo oo 



< 



^E^^EE 

t=i k=i e=i 



Pk/, 



PriYt- T^in{ts/) [ht ' {Y^ 



Yt - T^m{t^/) ( h['''^\Y^ ^) 
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Thus 



n oo 



t=i k,e=i 

^ 71 oo / oo OO 

- -E^*E E^'M.nl^tl'+E^M, 

t=i k=i \e=i t=x 

^ n oo / ca oo \ 

n oo 
i=l fc,£=l 

= ^E^*(^''+^*')- 



min(t*,£) 



i=l 



Therefore, since n'^^rjn — )■ as n — )■ oo and ^[^0^] < oo, 



hmsup 77- E E ~ hf'^\Yl 

n^oo 

Putting ah pieces together, we obtain, a.s., 

hmsupL„(5) < L*, 

n—^oo 

and this proves the result. 



n oo 



a.s. 



□ 



5.2 Proof of Lemma 12.11 

To prove the first statement of the lemma, it will be enough to show that, for all 

g G M, 

E[p, {Y-q)]-E[pr {Y-qr)]>0. 
We separate the cases q > Qt and q < q^- 
(i) If g > (?r! then 

K[priY-q)]-E[priY-qr)] 

= E[{Y- q)iT - l[y<,]) - (y - qr){T - 1[y<,^])] 

= E [iY- q) (r - (l[y<,^] + l[g,<y<,])) - - qr){r - l[y<,^])] 

= E [{qr - q){T - l[y<,^])] -E[{Y- g)l[,,<y<,]] . 
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We have 

E [{qr - q){T - l[y<,,])] = {Qt - q) {t - F[Y < Qr]) 

= {qr-q) [r-Fy(F^(r))] 
> 

and, clearly, 

-E[{Y-q)l[g^^Y<,]] >0. 
This proves the desired statement. 

{a) li q < q-r, then 

E[pr {Y-q)]-E[pr{Y-qr)] 

= K[{Y- q){T - l[v <,]) - (y - qr){T - l[Y<q.])] 

= E [{Y- q){T - - (F - qr) (r - (l[y<,] + l[,<r<,.]))] 

= E [{qr - q){T - l[Y<g])] - E [(y - g,)(r - l[g<y<g.])] • 

For q < qr, P [Y < q] = Fyiq) < r. Consequently 

E[{qr-q){r-l^Y<g])] > 0. 

Since 

-E [(y-g,)(r-l[,<y<,^])] >0, 
we are led to the desired result. 

Suppose now that Fy is increasing. To establish the second statement of the 
lemma, a quick inspection of the proof reveals that it is enough to prove that, for 
q > qr, 

E[(y-g)l[,^<y<,]] <0. 
Take q' G {qr, q) and set S = [qr < Y < (^\. Clearly, 

P(S) = iV(g') - Fyiqr) > 0. 

Therefore 

E [{Y g)l[,,<y<,]] < E[(y - q)ls] < 0. 

□ 
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